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Abstract. Let T be a set of M classification procedures with values in 
[—1,1]. Given a loss function, we want to construct a procedure which 
mimics at the best possible rate the best procedure in T. This fastest 
rate is called optimal rate of aggregation. Considering a continuous scale 
of loss functions with various types of convexity, we prove that optimal 
rates of aggregation can be either ((logM) /n) 1 ^ 2 or (log M)/n. We prove 
that, if all the M classifiers are binary, the (penalized) Empirical Risk 
Minimization procedures are suboptimal (even under the margin/low 
noise condition) when the loss function is somewhat more than convex, 
whereas, in that case, aggregation procedures with exponential weights 
achieve the optimal rate of aggregation. 

1 Introduction 

Consider the problem of binary classification. Let (X ', .4) be a measurable space. 
Let (X, Y) be a couple of random variables, where X takes its values in X and 
Y is a random label taking values in { — 1, 1}. We denote by tt the probability 
distribution of (X, Y). For any function <\> : M i — > M, define the </>— risk of a real 
valued classifier / : X i — ► R by 

^(/) = E[0(y/(x))]. 

Many different losses have been discussed in the literature along the last decade 
(cf. [10I13I26I14I6] ). for instance: 

4>o(x) = I( x <o) classical loss or — 1 loss 

4>i(x) = max(0, 1 — x) hinge loss (SVM loss) 

x i — ► log 2 (l + exp(— x)) logit-boosting loss 

x i — > cxp(— x) exponential boosting loss 

x i — ► (1 — x) 2 squared loss 

x i — > max(0, 1 — x) 2 2-norm soft margin loss 

We will be especially interested in losses having convex properties as it is con- 
sidered in the following definition (cf. [17]). 
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Definition 1. Let <j> : K i — > M be a function and (3 be a positive number. We 
say that (f> is (3— convex on [—1, 1] when 

W{x)] 2 <0<j>"(x), V\x\<l. 

For example, logit-boosting loss is (e/ log 2)— convex, exponential boosting loss 
is e— convex, squared and 2— norm soft margin losses are 2— convex. 

We denote by a function from X to R which minimizes over all real- 
valued functions and by At = f A^(f^) the minimal 0— risk. In most of the cases 
studied f£ or its sign is equal to the Bayes classifier 

f*(x)= sigmXx) -1), 

where r\ is the conditional probability function x i — > ¥(Y — 1\X = x) defined 
on X (cf. [3 26 34J). The Bayes classifier /* is a minimizer of the 0o~ risk (cf. 

m ' _^ 

Our framework is the same as the one considered, among others, by |27I33I7| 

and [20|l7j . We have a family T of M classifiers /i , . . . , fu an d a loss function 
tp. Our goal is to mimic the oracle min/ e jr(^4 <?i (/) — Af) based on a sample D n of 
n i.i.d. observations (X±, Y\), . . . , (X n , Y n ) of (X, Y). These classifiers may have 
been constructed from a previous sample or they can belong to a dictionary of 
simple prediction rules like decision stumps. The problem is to find a strategy 
which mimics as fast as possible the best classifier in T . Such strategies can 
then be used to construct efficient adaptive estimators (cf. 27 22 23li)| i. We 
consider the following definition, which is inspired by the one given in [29j for 
the regression model. 

Definition 2. Let 4> be a loss function. The remainder term "/(n, M) is called 
optimal rate of aggregation for the 0— risk, if the following two inequalities 
hold. 

i ) For any finite set J 7 of M functions from X to [— 1 , 1] , there exists a statistic 
f n such that for any underlying probability measure tt and any integer n > 1, 

E[A*{f n ) - At] < mm (A*(f) - Af) + C l7 (n, M). (1) 

ii) There exists a finite set T of M functions from X to [—1, 1] such that for 
any statistic f n there exists a probability distribution n such that for all 
n>l 

E [A*(f n ) - Af] > mm (A*(f) - Af) + C 2l (n, M). (2) 

Here C\ and Ci are absolute positive constants which may depend on (j). More- 
over, when the above two properties i) and ii) are satisfied, we say that the 
procedure f n , appearing in {!]], is an optimal aggregation procedure for 
the 0— risk. 



The paper is organized as follows. In the next Section we present three aggre- 
gation strategies that will be shown to attain the optimal rates of aggregation. 
Section 3 presents performance of these procedures. In Section 4 we give some 
proofs of the optimality of these procedures depending on the loss function. In 
Section 5 we state a result on suboptimality of the penalized Empirical Risk 
Minimization procedures and of procedures called selectors. In Section 6 we give 
some remarks. All the proofs are postponed to the last Section. 



2 Aggregation Procedures 

We introduce procedures that will be shown to achieve optimal rates of aggre- 
gation depending on the loss function <f> : R i — > E. All these procedures are 
constructed with the empirical version of the </>— risk and the main idea is that 
a classifier fj with a small empirical 0— risk is likely to have a small cf>— risk. We 
denote by 



1 

A^(/) = -^0(F,/(X i )) 

n — ' 



n 

i=l 

the empirical 0— risk of a real-valued classifier /. 

The Empirical Risk Minimization (ERM) procedure, is defined by 

/™eArgmin^(/)- (3) 

This is an example of what we call a selector which is an aggregate with values 
in the family T . Penalized ERM procedures are also examples of selectors. 
The Aggregation with Exponential Weights (AEW) procedure is given by 

# BW = X><">(/)/> (4) 

where the weights w^ n \f) are defined by 

w (n) (/)= exp(-n^(/)) v/€ ^ (5) 



The Cumulative Aggregation with Exponential Weights (CAEW) proce- 
dure, is defined by 



fZf w ' = -£/j£f w , (8) 



n 

k=\ 



where f£p W is constructed as in (U) based on the sample (Xl, Yj.), ... , (Xk, Yk) 
of size k and with the 'temperature' parameter (3 > 0. Namely, 

exp (-p-^-kAiifj) 
= E 4HM where W f\f) = - ^ J V/ e T. 



The idea of the ERM procedure goes to Le Cam and Vapnik. Exponential 
weights have been discussed, for example, in [2I15I19I33I7I25I35TT] or in [3218] in 
the on-line prediction setup. 



3 Exact Oracle Inequalities. 

We now recall some known upper bounds on the excess risk. The first point of 
the following Theorem goes to [3T], the second point can be found in [T5] or [§] 
and the last point, dealing with the case of a (3— convex loss function, is Corollary 
4.4 of Q2]. 

Theorem 1. Let <p : R i — ► K be a bounded loss function. Let T be a family of 
M functions f±, . . . , fu with values in [—1,1], where M > 2 is an integer. 

i) The Empirical Risk Minimization procedure f n = f„ RM satisfies 



E[A*(f n ) - Af] < wm(A+(f) - Af) + cJ l ^K, (7) 
/ v n 

where C > is a constant depending only on </>. 
ii) If 4> is convex, then the CAEW procedure f n = f^ AEW with "temperature 

parameter" (3=1 and the AEW procedure f n = f^ EW satisfy ^7jj. 
Hi) If <f> is (3— convex for a positive number /3, then the CAEW procedure with 

"temperature parameter" (3, satisfies 

nAHf?J EW ) - At] < min(^(/) - Af) + 

J €lJ~ 71 



4 Optimal Rates of Aggregation. 

To understand how behaves the optimal rate of aggregation depending on the 
loss we introduce a "continuous scale" of loss functions indexed by a non negative 
number h, 

, . _ / h<t>x{x) + (1 - h)(j) {x) if < h < 1 
<Ph[x) ~ \{h-l)x 2 -x + l if/i>l, 

defined for any where 4>o is the — 1 loss and <pi is the hinge loss. 

This set of losses is representative enough since it describes different type 
of convexity: for any h > 1, <fih is [3— convex on [—1, 1] with (3 > f3h = (2/i — 
l) 2 /(2(h — 1)) > 2, for h = 1 the loss is linear and for h < 1, 4>h is non-convex. 
For h > 0, we consider 

Mf) = f A<Ph (f)Jh = fk and A* h = f Af = A*»(f h ). 



Theorem 2. Let M > 2 be an integer. Assume that the space X is infinite. 

IfO<h<l, then the optimal rate of aggregation for the </>/,. — risfc is achieved 
by the ERM procedure and is equal to 

/logM 

v— ■ 

For h = 1, £/ie optimal rate of aggregation for the <pi — risk is achieved by the 
ERM, the AEW and the CAEW ( with 'temperature ' parameter (3 = 1) procedures 
and is equal to 

/logM 

If h > 1 then, the optimal rate of aggregation for the (ph—risk is achieved by 
the CAEW, with 'temperature' parameter [3h and is equal to 

logM 



5 Suboptimality of Penalized ERM Procedures. 

In this Section we prove a lower bound under the margin assumption for any 
selector and we give a more precise lower bound for penalized ERM procedures. 
First, we recall the definition of the margin assumption introduced in |30j . 
Margin Assumption(MA): The probability measure ir satisfies the margin 
assumption MA(k), where k > 1 if we have 

E[\f(X)-f(X)\]<c(A (f)-A* o y/«, (8) 

for any measurable function f with values in {—1, 1} 

We denote by V K the set of all probability distribution 7r satisfying MA(rt). 

Theorem 3. Let M > 2 be an integer, k > 1 be a real number, X be infinite 

and (f> : R i — ► M be a loss function such that = f <p(—l) — 0(1) > 0. There 
exists a family J- of M classifiers with values in {—1,1} satisfying the following. 

Let f n be a selector with values in T . Assume that (log M)/n < 1/2. There 
exists a probability measure n G "P K and an absolute constant C3 > such that 
f n satisfies 



E 



AHfn) ~ At] > mm (A*(f) A*) + ft(^) ^ ■ 0) 



Consider the penalized ERM procedure fn ERM associated with T , defined by 
/^ flM eArgmin(^(/)+pen(/)) 



where the penalty function pen(-) satisfies |pen(/)| < Cy (log M) / n, V/ G T, 
with < C < y/2/3. Assume that 1188ttC 2 M 9c2 logM <n. If k > 1 then, there 



exists a probability measure tt € V K and an absolute constant C4 > such that 
the penalized ERM procedure fn ERM satisfies 



E 



logM 



Remark 1 Inspection of the proof shows that Theorem^ is valid for any family 
T of classifiers fx, ... , fu, with values in {—1,1}, such that there exist points 
xi, . . . ,x 2 m in X satisfying {(fi{xj), . . . , f M (xj)) ■ j = 1, • • • , 2 M } = {-1, 1} M . 

Remark 2 If we use a penalty function such that |pen(/)| < 7n _1 / 2 ,V/ G T ', 
where 7 > is an absolute constant (i.e. < C < 7(log M )~ 1 ^ 2 ), then the 
condition "11887rC 2 M 9C ' log M < n" of Theorem[3\is equivalent to "n greater 
than a constant". 

Theorem [3] states that the ERM procedure (and even penalized ERM proce- 
dures) cannot mimic the best classifier in T with rates faster than ((log M)/?-?,) 1 / 2 
if the basis classifiers in T are different enough, under a very mild condition on 
the loss <f>. If there is no margin assumption (which corresponds to the case 
K = +00), the result of Theorem [3] can be easily deduced from the lower bound 
in Chapter 7 of [TT] . The main message of Theorem [3] is that such a negative 
statement remains true even under the margin assumption MA(k). Selectors 
aggregate cannot mimic the oracle faster than ((logM)/^) 1 / 2 in general. Un- 
der MA(k), they cannot mimic the best classifier in T with rates faster than 
((logM)/n) K /( 2K ~ 1 ) (which is greater than (logM)/n when k > 1). We know, 
according to Theorem [TJ that the CAEW procedure mimics the best classifier 
in T at the rate (logM)/n if the loss is (3— convex. Thus, penalized ERM pro- 
cedures (and more generally, selectors) are suboptimal aggregation procedures 
when the loss function is fj— convex even if we add the constraint that tt satisfies 
MA(k). 

We can extend Theorem [3] to a more general framework [24] and we obtain 
that, if the loss function associated with a risk is somewhat more than con- 
vex then it is better to use aggregation procedures with exponential weights 
instead of selectors (in particular penalized ERM or pure ERM). We do not 
know whether the lower bound © is sharp, i.e., whether there exists a selector 
attaining the reverse inequality with the same rate. 



6 Discussion. 



We proved in Theorem [2] that the ERM procedure is optimal only for non-convex 
losses and for the borderline case of the hinge loss. But, for non-convex losses, 
the implementation of the ERM procedure requires minimization of a function 
which is not convex. This is hard to implement and not efficient from a practical 
point of view. In conclusion, the ERM procedure is theoretically optimal only for 
non-convex losses but in that case it is practically inefficient and it is practically 
efficient only for the cases where ERM is theoretically suboptimal. 



For any convex loss <j>, we have ± YZ=i A<t> (fk,p W ) ^ A^{f^ AEW ). Next, 
less observations are used for the construction of f AEW ,l < k < n — 1, than 
for the construction of f^ EW - We can therefore expect the 0— risk of f^ EW to 
be smaller than the <p— risk of f AEW for all 1 < k < n — 1 and hence smaller 
than the 0— risk of fn AEW - Thus, the AEW procedure is likely to be an optimal 
aggregation procedure for the convex loss functions. 

The hinge loss happens to be really hinge for different reasons. For losses 
"between" the 0—1 loss and the hinge loss (0 < h < 1), the ERM is an optimal 
aggregation procedure and the optimal rate of aggregation is ^/(logM)/n. For 
losses "over" the hinge loss (h > 1), the ERM procedure is suboptimal and 
(logAf)/n is the optimal rate of aggregation. Thus, there is a breakdown point 
in the optimal rate of aggregation just after the hinge loss. This breakdown can 
be explained by the concept of margin : this argument has not been introduced 
here by the lack of space, but can be found in [24]. Moreover for the hinge loss 
we get, by linearity 

mm AUf) - A*, = mm AUf) - A,, 

fee v ; 1 fer v ' 1 

where C is the convex hull of T . Thus, for the particular case of the hinge loss, 
"model selection" aggregation and "convex" aggregation are identical problems 
(cf. [21] for more details). 



7 Proofs. 

Proof of Theorem [2} The optimal rates of aggregation of Theorem [2] are 
achieved by the procedures introduced in Section (2] Depending on the value of 
h, Theorem [1] provides the exact oracle inequalities required by the point {!]) of 
Definition [2] To show optimality of these rates of aggregation, we need only to 
prove the corresponding lower bounds. We consider two cases: < h < 1 and 
h > 1. Denote by V the set of all probability distributions on/fx {-1,1}. 

Let < h < 1. It is easy to check that the Bayes rule /* is a minimizer of 
the 4>h— risk. Moreover, using the inequality Ai(f) — A\ > A a (f) — A* a , which 
holds for any real- valued function / (cf. [M]), we have for any prediction rules 
fi, . . . , Jm (with values in { — 1,1}) and for any finite set T of M real valued 
functions, 



inf sup 



> inf 

fn 



A h (f n )-At 



min(A h (f) 



sup 

fe{/i /m} 



E 



A h (f n )-A* h 



A* 
> inf 

fn 



sup 

/*6{/i,...,/m} 



(10) 



A? 



Let N be an integer such that 2 Ar ~ 1 < M, x%, . . . ,xjv be N distinct points 
of X and id be a positive number satisfying (N — l)w < 1. Denote by P x 
the probability measure on X such that P x {{xj}) = w, for j = 1, . . . , N - 1 



and P x ({x N }) = 1 - (N - l)w. We consider the cube Q = {-1, l}^ 1 . Let 
< f) < 1. For all a = {<j\, . . . , on-i) S ^ we consider 



(1 + ajt))/2 ifx = x l ,..., kjv_x, 

1 if X = Xjy. 



For all c € 17 we denote by 7^ the probability measure on X x { — 1,1} defined 
by its marginal P x on A and its conditional probability function rj a . 

We denote by p the Hamming distance on fl. Let a, a' £ fl such that 
p(a,a') = 1. Denote by iJ the Hellinger's distance. Since H 2 (tt®", tt®,™) = 

2^1 - (l - ff 2 (7r CT ,7iv)/2)' 1 ) and H 2 (ir a ,TT a ,) = 2w(l - yj\ - then, the 

Hellinger's distance between the measures irf n and 7r® n satisfies 

^2 ( jr 8» i ^n) = 2 (l _ (1 - «,(! _ F))") . 



Take w and t) such that w(l - yjl - rj 2 ) < n~ x . Then, H 2 (ir® n ,irf, n ) < 
2(1 — e _1 ) < 2 for any integer n. 

Let a £ J? and / n be an estimator with values in { — 1,1} (only the sign of a 
statistic is used when we work with the — 1 loss). For tt = ir a , we have 



N-l 



W^o(/n) - A*} > tjwE^ [ ^ \ f n ( Xj ) - a, 

3=1 

Using Assouad's Lemma (cf. Lemma [T|), we obtain 



inf sup I 



A (f n )-Al 



> t)w- 



N- 1 



4e^ 



(11) 



Take now w = (nf) 2 ) -1 , N = [log Mj log 2] , f) = (n- 1 riogM/log2]) 1/2 . We 
complete the proof by replacing w, fj and A~ in (|11[) and (JTU]) by their values. 

For the case h > 1, we consider an integer AT such that 2 JV ~ 1 < M, A^ — 1 
different points x\, . . . , a; at of A" and a positive number u> such that (N~l)w < 1. 



We denote by P the probability measure on X such that P ({xj}) 
for j = 1, . . . , N - 1 and P x ({xat}) = 1 - (N - l)w. Denote by Q the cube 
{ — 1, l}^ -1 . For any a £ Q and h > 1, we consider the conditional probability 
function r\ a in two different cases. If 2(h — 1) < 1 we take 



„ M _/(l + 2^(^-l))/2ifx = x 1 ,. 
^ J \2(ft-l) ifx^xjv, 



and if 2(ft, — 1) > 1 we take 



(1 + <Jj)/2 \i x — x\, . 
1 if x = xn ■ 



,xn-i 



For all a £ Q we denote by n a the probability measure on A* x {-1, 1} with the 
marginal P x on X and the conditional probability function rj a of Y knowing X. 



Consider 

... ( 1 if 2(h — 1) < 1 , */ n f <Ji if a; = xi, . . . , xm-i 

= { (4(h - if 2(* - 1) ; 1 and = { { if x = Z 

A minimizer of the fa— risk when the underlying distribution is 7r CT is given by 

for any h > 1 and cr e fl. 

When we choose {/£ a : a £ J?} for the set = {/i,...,/m} of basis 
functions, we obtain 



Uu-Jm} f n itev 



sup inf sup E A h (f n ) - A* h - min (A h (fj) - A* h ) 



j=l,...,M 



> inf sup 

fn tt£V: 



(E [A h (/ n )-^]). 



Let a be an element of Q. Under the probability distribution -K ai we have Ah(f) — 
A* h = (h- l)E[(f(X) - ft tCJ {X))% for any real-valued function / on X. Thus, 

for a real valued estimator f n based on D ni we have 



N-l 



Mh) -A* h >(h- l)w - P(h)^f. 

We consider the projection function iph( x ) = i>(x/p(h)) for any x £ X, where 
tp(y) = max(— 1, min(l, y)), My £ R. We have 

N-l 

K[A h (f n ) - A* h ] > W(h MlM/nfe)) - P(h)v 3 ) 2 



3=1 



N-l 



> w(h l)(p(h)f J2 °3? 

3=1 



> 4w(h - l)(p(h)) 2 inf maxE 



N-l 
3 = 1 



where the infimum info.gfo.1]"- 1 is taken over all estimators a based on one obser- 
vation from the statistical experience {nf n \a £ Q} and with values in [0, l]^ -1 . 

For any a, a' £ ft such that p(a,a') = 1, the Hellinger's distance between 
the measures Trf n and nf" satisfies 



H 2 (7rf n ,7$ n ) 



2 (f - (1 - 2iu(l - VT - 7?))") if 2(h - 1) < 1 
2 (l - (1 - 2w(l - •v/374))") if 2(/i — 1) > 1 " 



We take 

J(2n(fc-l) 2 )if2(fc-l)<l 
\ 8U- 1 if 2(/i — 1) > 1. 

Thus, we have for any a, a' € Q such that p(tr, a') = 1, 

H 2 ^®n^®n^ < 2 (l- e - 1 ). 

To complete the proof we apply Lemma [1] with N = [(log M)/n] . 
Proof of Theorem [3j Consider T a family of classifiers fi, . . . , /m> with 
values in {—1, 1}, such that there exist 2 points x\,... ,x 2 m in X satisfying 

{(fifa), f M ( Xj )) : j 1 2^} = {-1, 1} M ^ S M - 

Consider the lexicographic order on Sm- 

^ (-1,...,-1,1) ^ (-1,..., -1,1,-1) ... • :1 L). 

Take j in {1, . . . , 2 M } and denote by x'j the element in {x\, . . . , x 2 m} such that 
(fi(x'j), . . . , /Af(a^j)) is the j— th element of Sm for the lexicographic order. We 
denote by <p the bijection between Sm and {xi, . . . ,x 2 m} such that the value 
of ip at the j— th element of Sm is Xj. By using the bijection ip we can work 
independently either on the set Sm or on {x%, . . . , x 2 m }. Without any assumption 
on the space X , we consider, in what follows, functions and probability measures 
on Sm- Remark that for the bijection tp we have 

fMx)) = x\ Vx = {x\ x M ) e S M , Vj e {1, . . . , M}. 

With a slight abuse of notation, we still denote by J- the set of functions 
/i, . . . , f M defined by fj(x) = x 3 , for any j = 1, . . . ,M. 

First remark that for any /, g from X to { — 1, 1}, using E[0(Y f(X))\X] = 
E[^(y)|X]% (x)=1) +E[^(-Y)|X]ir (/w= _ 1) , we have 

E[4,(Yf(X))\X] - E[0(y 5 (X))|X] = o^(l/2 - v(X))(f(X) - g(X)). 

Hence, we obtain A^(f) — A^{g) — a,p(A (f) — A (g)). So, we have for any 
j = l,...,M, 

A<f>{f j )-A<f>{t) = H {A Q {f j )-AZ). 

Moreover, for any / : S M i — ► {-1, 1} we have A^(/) = 0(1) + a^°(/) and 
> by assumption, hence, 

j-pERM G Arg ; n(^0 (/) + pen(/)) . 

Thus, it suffices to prove Theorem G2 when the loss function is the classical 
— 1 loss function </>o- 

We denote by S M +i the set {-1, 1} M+1 and by X°, . . .,X M , M + 1 inde- 
pendent random variables with values in { — 1,1} such that X° is distributed 
according to a Bernoulli B(w, 1) with parameter w (that is P(X° = 1) = w and 
P(X° = -1) = 1 - w) and the M other variables X 1 , . . .,X M are distributed 



according to a Bernoulli #(1/2,1). The parameter < w < 1 will be chosen 
wisely in what follows. 

For any j e {1,...,M}, we consider the probability distribution ttj = 
(P x ,7]^ ) of a couple of random variables (X, Y) with values in Sm+i x {— 1, 1}, 
where P x is the probability distribution on Sm+i of X = (X°, . . . ,X M ) and 
r]^(x) is the regression function at the point x e Sm+i, of Y = 1 knowing that 
X = x, given by 

C 1 if a; = 1 

rf j \x)= I 1/2 + ft/2 if x° = -l,xi = -1 , Vi=(i ,i 1 ,..,3; m )€ 1 Sm+ 1 , 
( 1/2 + ft if a; = -l,a; J ' = 1 

where h > is a parameter chosen wisely in what follows. The Bayes rule /*, 
associated with the distribution ttj = [P x ,r]^), is identically equal to 1 on 
Sm+i- 

If the probability distribution of (X, Y) is ttj for a j g {1, . . . , M} then, for 
any < t < 1, we have P[|2r?(X) - 1| < t] < (1 - tu)l/,< t . Now, we take 

1 - to = ft^=r, 

then, we have P[|2r?(X) - 1| < t] < and so ttj e 7> K . 

We extend the definition of the /j's to the set <Sm+i by /j(a;) = x 3 for 
any a; = {x° , x M ) e S M +i and j = 1, . . . , M. Consider T = {/i, . . . , /m}- 
Assume that (X, y) is distributed according to nj for a j € {1, . . . , M}. For any 
fee {1, . . . , M} and fc ^ j, we have 

A (/ fe )-AS= Y, M*) V2H/fc(») - = = W) + f 

and the excess risk of is given by A (fj) — Aq = (1 — w)h/A + w/2. Thus, we 
have 

mm A (/) - A* = A (fj) - A* = (1 - «,)ft/4 + «,/2. 

First, we prove the lower bound for any selector. Let /„ be a selector with 
values in T . If the underlying probability measure is iTj for a j € {1, . . . , M} 
then, 

M 

E«L4 (/„) - Aq] = Y(Mfk) A*)nf n [f n = f k ] 
k=i 

= Twn(Mf) A*) + M±j-^irfn[f n ± fj ], 

where denotes the expectation w.r.t. the observations D n when [X, Y) is 
distributed according to ttj. Hence, we have 

i max M {E«[yl (/„)-AS]-min(A (/)-^)} > inf ms^ nf n [$ n + j], 



where the infimum inf^ is taken over all tests valued in {1, ... , M} constructed 
from one observation in the model (Sm+i x {— 1, ^},A x T, {n 1: . . . , 7Tm})®™, 
where T is the natural a— algebra on {—1, 1}. Moreover, for any j G {1, . . . , M}, 
we have 

J 11 j S 4(1-/1-2^)' 

where K(P\Q) is the Kullback-Leibler divergence between P and Q (that is 
J \og(dP/dQ)dP if P << Q and +oo otherwise). Thus, if we apply Lemma 
with h = ((log M)/n) (K-1 )/ (2K ~ 1 ), we obtain the result. 

Second, we prove the lower bound for the pERM procedure /„ = f% ERM ■ 
Now, we assume that the probability distribution of (X, Y) is ttm and we take 



h=(C 



We have E[A (f n ) - A*} = min(A (/) - A*) + ^ w) P[f n ^ f M ]. Now, we 
upper bound P[/„ = Jm], conditionally to y = (Y l5 . . . , Y n ). We have 

nu = /m|^] 

= P[Vj = 1, . . . , M - 1, A£°(/ M ) + pen(/ M ) < At{f 3 ) + pen^y] 
= P[Vj = 1, . . . , M - 1, v M < Vj + n(pen(/,) - pen(/ M ))|y], 

where Vj = Yh=i \Y t xi<ay ^3 = 1, • • • ,M and X t = (Xj)j= 0j ... t M € <Sm+i, Vi = 
1, . . . , n. Moreover, the coordinates Xf,i = l,...,n;j = 0, . . . , M are inde- 
pendent, Yi, . . . , Y n are independent of X?,i — 1, . . . ,n;j — 1, ... ,M — 1 and 
|pen(/,-)| < h K ^ K - l \\/j = 1, . . . ,M. So, we have 

n M-1 

P[fn = fu\y] = $>Kf = fc|y] p [ fc ^ v i + »(P cn (^) - pen(/ M ))|y] 

fc=0 3=1 

< f Wm = k\y] (P[k <ui + 2nh K ^ K -V \y]j 

< PK/ < k\y] + (P[fc < vi + 2nh K l^-^\y]) M -\ 

where 

k = E[v M \y]-2nh K ^ K -^ 

= 2 ^ CaTsS^- 1 ' + 1 + M/C-i>(3fc/4 - 1/2) ^=i)J - 2 ^ 
Using Einmahl and Masson's concentration inequality (cf. |12p. we obtain 
PKf < k\y] < exp(-2nh 2K ^ K - 1) ). 



Using Berry-Esseen's theorem (cf. p. 471 in [?]), the fact that y is independent 
of (X?; 1 < i < n, 1 < j < M - 1) and k > n/2 - 9nh K / (K -^/A, we get 



°[k < v x +2nh^\y] < P 



n/2 — v\ 



« _ 66 
< <P(6h—^) + -=, 

Jn 



where stands for the standard normal distribution function. Thus, we have 

(13) 

(l-w)h 



E[A (f n )-A* }>rmn(A (f)-A* ) 



1 - exp(-2n/i 2K/(K - 1) ) - (${ioh K/ ^- l) y/n) + 66/Vn 



M-l 



Next, for any a > 0, by the elementary properties of the tails of normal 
distribution, we have 



1 - <2>(a) 



1 



'2ir J a 



+ 00 



exp(-i 2 /2)cft > 



2ir(a 2 + 1) 



(14) 



Besides, we have for < C < v2/6 (a modification for C = is obvious) and 
(3376C) 2 (27rM 36c2 logM) < n, thus, if we replace h by its value given in JISJl 
and if we apply (fl4|) with a — IGCy/logM, then we obtain 



M-l 



< exp 



M 



1-180" 



66(M - 1) 



18CV27riogM 

Combining (JT3J) and |T5|). we obtain the result with C 4 = (C/4) ( l-exp(-8C 2 



(15) 



exp(-l/(36CV27riog2)) >0 



The following lemma is used to establish the lower bounds of Theorem [2J It 
is a version of Assouad's Lemma (cf. 28 ). Proof can be found in [24] , 

Lemma 1. Lei (^,^1) 6e a measurable space. Consider a set of probability 
{P^/u € ^} indexed by the cube fl = {0, l} m . Denote by E w the expectation 
under P w . Let 6 > 1 &e a number. Assume that: 

Vu,u' e Q/p(u,u') = 1, H 2 {P^,P^,) <a<2, 



th 



en we nave 



inf max 

6e[o,i] m uefl 



E 

3=1 



>m2- 3 - e (2-a) 2 



where the infimum inf,Be[o,i] m *s taken over all estimator based on an observation 
from the statistical experience {Puj\uj € /2} and wii/i values in [0, l] m . 



We use the following lemma to prove the weakness of selector aggregates. A 
proof can be found p. 84 in [28: . 

Lemma 2. Let Pi,...,Pm be M probability measures on a measurable space 
1 M 

(Z,T) satisfying — V if(Pj|Pi) < alogM, where < a < 1/8. We have 

3 = 1 



inf ^ rnax^Pj(<£ ^ j) > : ^ f 1 - 2a - 2 1 |f ^ r 



i<j<M l + VM V " V log 2 

where the infimum inf t «s taken over all tests cf> with values in {1, . . . , M } con- 
structed from one observation in the statistical model (Z,T, {Pi, . . . ,Pm})- 
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